<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8" >

<title>特征工程 - 百面机器学习 | 少数派报告</title>
<meta name="description" content="Minority Report">

<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no">

<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous">
<link rel="shortcut icon" href="https://www.timegarage.works/favicon.ico?v=1577944220707">
<!-- <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.10.0/katex.min.css"> -->
<link rel="stylesheet" href="https://www.timegarage.works/styles/main.css">


  
    <link rel="stylesheet" href="https://unpkg.com/gitalk/dist/gitalk.css" />
  

  


<script src="https://cdn.jsdelivr.net/npm/vue/dist/vue.js"></script>
<script src="https://cdn.bootcss.com/highlight.js/9.12.0/highlight.min.js"></script>

<link rel="stylesheet" href="https://unpkg.com/aos@next/dist/aos.css" />



  </head>
  <script async src="//busuanzi.ibruce.info/busuanzi/2.3/busuanzi.pure.mini.js"></script>
  <body>
    <div id="app" class="main">
      <div class="sidebar" :class="{ 'full-height': menuVisible }">
  <div class="top-container">
    <div class="top-header-container">
      <a class="site-title-container" href="https://www.timegarage.works">
        <img src="https://www.timegarage.works/images/avatar.png?v=1577944220707" class="site-logo">
        <h1 class="site-title">少数派报告</h1>
      </a>
      <div class="menu-btn" @click="menuVisible = !menuVisible">
        <div class="line"></div>
      </div>
    </div>
    <div>
      
        
          <a href="/" class="site-nav">
            首页
          </a>
        
      
        
          <a href="/archives" class="site-nav">
            归档
          </a>
        
      
        
          <a href="/tags" class="site-nav">
            标签
          </a>
        
      
        
          <a href="/post/Portfolio/" class="site-nav">
            作品
          </a>
        
      
        
          <a href="/post/about/" class="site-nav">
            关于
          </a>
        
      
    </div>
  </div>
  <div class="bottom-container">
    <div class="social-container">
      
        
      
        
      
        
      
        
      
        
      
    </div>
    <div class="site-description">
      Minority Report
    </div>
    <div class="site-footer">
      Copyright © 2020 TimeGarage Inc. | <a class="rss" href="https://www.timegarage.works/atom.xml" target="_blank">RSS</a>
    </div>
  </div>
</div>

      <div class="main-container">
        <div class="content-container" data-aos="fade-in">
          <div class="post-detail">
            <h2 class="post-title">特征工程 - 百面机器学习</h2>
            <div class="post-date">
              2019-11-15 
              <span id="busuanzi_container_page_pv">
                  &emsp;📖<span id="busuanzi_value_page_pv"></span>
              </span>
            </div>
            
            <div class="post-content">
              <p>特征工程的实际应用</p>
<!-- more -->
<h1 id="特征工程">特征工程</h1>
<h3 id="特征归一化">特征归一化</h3>
<ul>
<li>目的：消除量纲影响</li>
<li>数值型特征归一化方法
<ul>
<li>线性函数归一化</li>
<li>零均值归一化</li>
</ul>
</li>
<li>归一化对梯度下降法收敛速度的影响</li>
</ul>
<h3 id="类别型特征">类别型特征</h3>
<ul>
<li>
<p>序号编码</p>
<blockquote>
<p>序号编码通常用于处理类别间具有大小关系的数据</p>
</blockquote>
</li>
<li>
<p>独热编码</p>
<blockquote>
<p>独热编码通常用于处理类别间不具有大小关系的特征</p>
</blockquote>
<ul>
<li>使用稀疏向量来节省空间</li>
<li>配合特征选择降低维度
<ul>
<li>高维度特征带来的问题
<ul>
<li>KNN下，两点距离很难度量</li>
<li>LR下，参数数量增加，容易引起过拟合</li>
<li>只有部分维度对分类、预测有帮助</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>
<p>二进制编码</p>
</li>
<li>
<p>扩展</p>
<ul>
<li>Helmert Contrast</li>
<li>Sum Contrast</li>
<li>Polynomial Contrast</li>
<li>Backward Difference Contrast</li>
</ul>
</li>
</ul>
<h3 id="高维组合特征的处理">高维组合特征的处理</h3>
<ul>
<li>目的：提高对复杂关系的拟合能力</li>
<li>组合</li>
<li>分解
<ul>
<li>矩阵分解</li>
</ul>
</li>
</ul>
<h3 id="组合特征">组合特征</h3>
<ul>
<li>基于决策树的特征组合寻找方法
<ul>
<li>梯度提升决策树</li>
</ul>
</li>
</ul>
<h3 id="文本表示模型">文本表示模型</h3>
<blockquote>
<p>文本是一类非常重要的非结构化数据</p>
</blockquote>
<ul>
<li>
<p>词袋模型和N-gram模型</p>
<ul>
<li>
<p>TF-IDF</p>
<ul>
<li>
<p>TF-IDF(<em>t</em>,<em>d</em>)=TF(<em>t</em>,<em>d</em>)×IDF(<em>t</em>)</p>
<blockquote>
<p>其中 TF(<em>t</em>,<em>d</em>)为单词<em>t</em>在文档<em>d</em>中出现的频率，IDF(<em>t</em>)是逆文档频率，用来衡量单词<em>t</em>对表达语义所起的重要性。</p>
</blockquote>
</li>
</ul>
</li>
<li>
<p>N-gram</p>
<blockquote>
<p>通常，可以将连续出现的<em>n</em>个词（<em>n</em>≤<em>N</em>）组成的词组（N-gram）也作为一个单独的特征放到向量表示中去，构成 N-gram 模型。</p>
</blockquote>
</li>
<li>
<p>词干抽取 Word Stemming</p>
<blockquote>
<p>将不同词性的单词统一成同一词干的形式。</p>
</blockquote>
</li>
</ul>
</li>
<li>
<p>主题模型</p>
<ul>
<li>主题分布特性</li>
</ul>
</li>
<li>
<p>词嵌入与深度学习模型</p>
<ul>
<li>深度学习 》自动特征工程</li>
<li>CNN与RNN能够更好的对文本进行建模，抽取出一些高层的语义特征。</li>
</ul>
</li>
</ul>
<h3 id="word2vec">Word2Vec</h3>
<blockquote>
<p>CBOW 的目标是根据上下文出现的词语来预测当前词的生成概率；而 Skip-gram 是根据当前词来预测上下文中各词的生成概率;</p>
</blockquote>
<ul>
<li>网络结构
<ul>
<li>Continues Bag of Words （CBOW）</li>
<li>Skip-gram</li>
<li>层
<ul>
<li>输入层 -&gt; 每个词由独热编码方式表示</li>
<li>映射层</li>
<li>输出层</li>
</ul>
</li>
</ul>
</li>
</ul>
<h3 id="图像数据不足时的处理方法">图像数据不足时的处理方法</h3>
<ul>
<li>基于模型的方法 -&gt; 降低过拟合的风险
<ul>
<li>简化模型</li>
<li>添加约束项（如L1/L2正则项）</li>
<li>集成学习</li>
<li>Dropout</li>
</ul>
</li>
<li>基于数据的方法
<ul>
<li>数据扩充
<ul>
<li>随机旋转、平移、缩放、裁剪、填充、左右翻转</li>
<li>添加噪声</li>
<li>颜色变换</li>
<li>改变亮度、清晰度、对比度、锐度</li>
</ul>
</li>
<li>迁移学习</li>
</ul>
</li>
</ul>

            </div>
            
              <div class="tag-container">
                
                  <a href="https://www.timegarage.works/tag/BMJQXX" class="tag">
                    百面机器学习
                  </a>
                
                  <a href="https://www.timegarage.works/tag/Notes" class="tag">
                    读书笔记
                  </a>
                
              </div>
            
            

            
              
                <div id="gitalk-container" data-aos="fade-in"></div>
              

              
            
          </div>
        </div>
      </div>
    </div>

    <script src="https://unpkg.com/aos@next/dist/aos.js"></script>
<script type="application/javascript">

AOS.init();

hljs.initHighlightingOnLoad()

var app = new Vue({
  el: '#app',
  data: {
    menuVisible: false,
  },
})

</script>



  
    <script src="https://unpkg.com/gitalk/dist/gitalk.min.js"></script>
    <script>

      var gitalk = new Gitalk({
        clientID: '447058976c8bd4fe04f6',
        clientSecret: 'b4a632203857b02425883cbf9b5bea3d5a2f86b0',
        repo: 'timegarage.github.io',
        owner: 'TimeGarage',
        admin: ['TimeGarage'],
        id: (location.pathname).substring(0, 49),      // Ensure uniqueness and length less than 50
        distractionFreeMode: false  // Facebook-like distraction free mode
      })

      gitalk.render('gitalk-container')

    </script>
  

  




  </body>
</html>
